-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement System.Buffers.Text.Base64.DecodeFromUtf8 for Arm64 #70336
Conversation
Like the AVX2 and SSE3 versions, this is based off the Aklomp base64 algorithm. The AdvSimd API does not yet have support for squential multi register instructions, such as TBL4/LD4/ST3. This code implements the those instructions using single register instructions. Once API support is added, this code can be greatly simplified and get an additional performance boost.
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsLike the AVX2 and SSE3 versions, this is based off the Aklomp base64 The AdvSimd API does not yet have support for squential multi register
|
For this implementation I first started with the C implementation here: I took that and rewrote it in C# using the new Vector API (then falling back to the old API where equivalents don’t exist). Examining the assembler output, the code looks fairly decent. However, running the performance Base64Decode test on an Altra shows a 3x slowdown:
It’s not immediately obvious to me why it’s slower. I’m going to investigate further given the C version was still decent. I didn’t merge this code with the SSE2 and SSE3 versions because although they are doing similar things, a combined version would just be a bunch of “if x86 else if arm” clauses and almost no commonality. As per the Aklomp code, the Arm version is loading a lot more data each iteration (due to it relying on a full table lookup due to lack of shuffle) - making sure those are only loaded once from memory is crucial for performance. |
This is the code being produced for the main loop:
|
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static unsafe void AdvSimdDecode(ref byte* srcBytes, ref byte* destBytes, byte* srcEnd, int sourceLength, int destLength, byte* srcStart, byte* destStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious - does AggressiveInlining here show benefits in the benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove this one it doesn't make a difference.
If I remove the two around AdvSimdTbx3Byte and AdvSimdTbx8Byte it gets 2x worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those definitely make sense while this one is a bit questionable
// Compress each four bytes into three. | ||
Vector128<byte> dec0 = Vector128.BitwiseOr(Vector128.ShiftLeft(str0, 2), Vector128.ShiftRightLogical(str1, 4)); | ||
Vector128<byte> dec1 = Vector128.BitwiseOr(Vector128.ShiftLeft(str1, 4), Vector128.ShiftRightLogical(str2, 2)); | ||
Vector128<byte> dec2 = Vector128.BitwiseOr(Vector128.ShiftLeft(str2, 6), str3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jit doesn't do instruction selection so if you want your shifts to be side-by-side for better pipelining you need to extract them to temp locals, e.g.:
var sl0 = Vector128.ShiftLeft(str0, 2);
var sl1 = Vector128.ShiftLeft(str1, 4);
var sl2 = Vector128.ShiftLeft(str2, 6);
var sr1 = Vector128.ShiftRightLogical(str1, 4);
var sr2 = Vector128.ShiftRightLogical(str2, 2);
Vector128<byte> dec0 = Vector128.BitwiseOr(sl0, sr1);
Vector128<byte> dec1 = Vector128.BitwiseOr(sl1, sr2);
Vector128<byte> dec2 = Vector128.BitwiseOr(sl2, str3);
not sure it matters much in terms of perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying that verbatim didn't make any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to do this might be to treat the vector register as a single value and do something fancy with a masks and a single value shift. Not quite sure how that would look. Might get a some performance, but it'd be messy.
Vector128<byte> str0 = Vector128.LoadUnsafe(ref *src); | ||
Vector128<byte> str1 = Vector128.LoadUnsafe(ref *src, 16); | ||
Vector128<byte> str2 = Vector128.LoadUnsafe(ref *src, 32); | ||
Vector128<byte> str3 = Vector128.LoadUnsafe(ref *src, 48); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you can use AdvSimd.Arm64.LoadPairVector128
(and even nontemporal if it makes sense) here - jit is not smart enough yet to do it by itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this worked, but didn't give any improvement
Vector128<byte> str0 = Vector128.LoadUnsafe(ref *src); | ||
Vector128<byte> str1 = Vector128.LoadUnsafe(ref *src, 16); | ||
Vector128<byte> str2 = Vector128.LoadUnsafe(ref *src, 32); | ||
Vector128<byte> str3 = Vector128.LoadUnsafe(ref *src, 48); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you can use AdvSimd.Arm64.LoadPairVector128
(and even nontemporal if it makes sense) here - jit is not smart enough yet to do it by itself
I compiled your code locally and here is what I've got on Main: ; Method C:AdvSimdDecode(byref,byref,long,int,int,long,long)
G_M23748_IG01:
stp fp, lr, [sp,#-16]!
mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M23748_IG02:
ldr x3, [x0]
ldr x4, [x1]
;; size=8 bbWeight=1 PerfScore 6.00
G_M23748_IG03:
ld1 {v16.16b}, [x3]
add x5, x3, #16
ld1 {v17.16b}, [x5]
add x5, x3, #32
ld1 {v18.16b}, [x5]
add x5, x3, #48
ld1 {v19.16b}, [x5]
uzp1 v20.8h, v16.8h, v17.8h
uzp2 v21.8h, v16.8h, v17.8h
uzp1 v17.8h, v18.8h, v19.8h
uzp2 v19.8h, v18.8h, v19.8h
uzp1 v16.16b, v20.16b, v17.16b
uzp2 v17.16b, v20.16b, v17.16b
uzp1 v18.16b, v21.16b, v19.16b
uzp2 v19.16b, v21.16b, v19.16b
mvni v20.4s, #0x00
mvni v21.4s, #0x00
tbx v20.16b, {v21.16b}, v16.16b
ldr q21, [@RWD00]
sub v16.16b, v16.16b, v21.16b
mvni v21.4s, #0x00
tbx v20.16b, {v21.16b}, v16.16b
ldr q21, [@RWD00]
sub v16.16b, v16.16b, v21.16b
ldr q21, [@RWD16]
tbx v20.16b, {v21.16b}, v16.16b
ldr q21, [@RWD00]
sub v16.16b, v16.16b, v21.16b
ldr q21, [@RWD32]
tbx v20.16b, {v21.16b}, v16.16b
ldr q21, [@RWD00]
sub v16.16b, v16.16b, v21.16b
ldr q21, [@RWD48]
tbx v20.16b, {v21.16b}, v16.16b
ldr q21, [@RWD00]
sub v16.16b, v16.16b, v21.16b
ldr q21, [@RWD64]
tbx v20.16b, {v21.16b}, v16.16b
ldr q21, [@RWD00]
sub v16.16b, v16.16b, v21.16b
ldr q21, [@RWD80]
tbx v20.16b, {v21.16b}, v16.16b
ldr q21, [@RWD00]
sub v16.16b, v16.16b, v21.16b
ldr q21, [@RWD96]
tbx v20.16b, {v21.16b}, v16.16b
mov v16.16b, v20.16b
mvni v20.4s, #0x00
mvni v21.4s, #0x00
tbx v20.16b, {v21.16b}, v17.16b
ldr q21, [@RWD00]
sub v17.16b, v17.16b, v21.16b
mvni v21.4s, #0x00
tbx v20.16b, {v21.16b}, v17.16b
ldr q21, [@RWD00]
sub v17.16b, v17.16b, v21.16b
ldr q21, [@RWD16]
tbx v20.16b, {v21.16b}, v17.16b
ldr q21, [@RWD00]
sub v17.16b, v17.16b, v21.16b
ldr q21, [@RWD32]
tbx v20.16b, {v21.16b}, v17.16b
ldr q21, [@RWD00]
sub v17.16b, v17.16b, v21.16b
ldr q21, [@RWD48]
tbx v20.16b, {v21.16b}, v17.16b
ldr q21, [@RWD00]
sub v17.16b, v17.16b, v21.16b
ldr q21, [@RWD64]
tbx v20.16b, {v21.16b}, v17.16b
ldr q21, [@RWD00]
sub v17.16b, v17.16b, v21.16b
ldr q21, [@RWD80]
tbx v20.16b, {v21.16b}, v17.16b
ldr q21, [@RWD00]
sub v17.16b, v17.16b, v21.16b
ldr q21, [@RWD96]
tbx v20.16b, {v21.16b}, v17.16b
mov v17.16b, v20.16b
mvni v20.4s, #0x00
mvni v21.4s, #0x00
tbx v20.16b, {v21.16b}, v18.16b
ldr q21, [@RWD00]
sub v18.16b, v18.16b, v21.16b
mvni v21.4s, #0x00
tbx v20.16b, {v21.16b}, v18.16b
ldr q21, [@RWD00]
sub v18.16b, v18.16b, v21.16b
ldr q21, [@RWD16]
tbx v20.16b, {v21.16b}, v18.16b
ldr q21, [@RWD00]
sub v18.16b, v18.16b, v21.16b
ldr q21, [@RWD32]
tbx v20.16b, {v21.16b}, v18.16b
ldr q21, [@RWD00]
sub v18.16b, v18.16b, v21.16b
ldr q21, [@RWD48]
tbx v20.16b, {v21.16b}, v18.16b
ldr q21, [@RWD00]
sub v18.16b, v18.16b, v21.16b
ldr q21, [@RWD64]
tbx v20.16b, {v21.16b}, v18.16b
ldr q21, [@RWD00]
sub v18.16b, v18.16b, v21.16b
ldr q21, [@RWD80]
tbx v20.16b, {v21.16b}, v18.16b
ldr q21, [@RWD00]
sub v18.16b, v18.16b, v21.16b
ldr q21, [@RWD96]
tbx v20.16b, {v21.16b}, v18.16b
mov v18.16b, v20.16b
mvni v20.4s, #0x00
mvni v21.4s, #0x00
tbx v20.16b, {v21.16b}, v19.16b
ldr q21, [@RWD00]
sub v19.16b, v19.16b, v21.16b
mvni v21.4s, #0x00
tbx v20.16b, {v21.16b}, v19.16b
ldr q21, [@RWD00]
sub v19.16b, v19.16b, v21.16b
ldr q21, [@RWD16]
tbx v20.16b, {v21.16b}, v19.16b
ldr q21, [@RWD00]
sub v19.16b, v19.16b, v21.16b
ldr q21, [@RWD32]
tbx v20.16b, {v21.16b}, v19.16b
ldr q21, [@RWD00]
sub v19.16b, v19.16b, v21.16b
ldr q21, [@RWD48]
tbx v20.16b, {v21.16b}, v19.16b
ldr q21, [@RWD00]
sub v19.16b, v19.16b, v21.16b
ldr q21, [@RWD64]
tbx v20.16b, {v21.16b}, v19.16b
ldr q21, [@RWD00]
sub v19.16b, v19.16b, v21.16b
ldr q21, [@RWD80]
tbx v20.16b, {v21.16b}, v19.16b
;; size=552 bbWeight=8 PerfScore 1496.00
G_M23748_IG04:
ldr q21, [@RWD00]
sub v19.16b, v19.16b, v21.16b
ldr q21, [@RWD96]
tbx v20.16b, {v21.16b}, v19.16b
mov v19.16b, v20.16b
umaxp v20.16b, v16.16b, v17.16b
umaxp v21.16b, v18.16b, v19.16b
umaxp v20.16b, v20.16b, v21.16b
umov x5, v20.d[0]
tst x5, #0xd1ffab1e
bne G_M23748_IG06
;; size=44 bbWeight=8 PerfScore 96.00
G_M23748_IG05:
shl v16.16b, v16.16b, #2
ushr v20.16b, v17.16b, #4
orr v16.16b, v16.16b, v20.16b
shl v17.16b, v17.16b, #4
ushr v20.16b, v18.16b, #2
orr v17.16b, v17.16b, v20.16b
shl v18.16b, v18.16b, #6
orr v18.16b, v18.16b, v19.16b
movi v19.4s, #0x00
ldr q20, [@RWD112]
tbx v19.16b, {v16.16b}, v20.16b
ldr q20, [@RWD112]
ldr q21, [@RWD00]
sub v20.16b, v20.16b, v21.16b
tbx v19.16b, {v17.16b}, v20.16b
ldr q21, [@RWD00]
sub v20.16b, v20.16b, v21.16b
tbx v19.16b, {v18.16b}, v20.16b
st1 {v19.16b}, [x4]
movi v19.4s, #0x00
ldr q20, [@RWD128]
tbx v19.16b, {v16.16b}, v20.16b
ldr q20, [@RWD128]
ldr q21, [@RWD00]
sub v20.16b, v20.16b, v21.16b
tbx v19.16b, {v17.16b}, v20.16b
ldr q21, [@RWD00]
sub v20.16b, v20.16b, v21.16b
tbx v19.16b, {v18.16b}, v20.16b
add x5, x4, #16
st1 {v19.16b}, [x5]
movi v19.4s, #0x00
ldr q20, [@RWD144]
tbx v19.16b, {v16.16b}, v20.16b
ldr q16, [@RWD144]
ldr q20, [@RWD00]
sub v16.16b, v16.16b, v20.16b
tbx v19.16b, {v17.16b}, v16.16b
ldr q17, [@RWD00]
sub v16.16b, v16.16b, v17.16b
tbx v19.16b, {v18.16b}, v16.16b
add x5, x4, #32
st1 {v19.16b}, [x5]
add x3, x3, #64
add x4, x4, #48
cmp x3, x2
bls G_M23748_IG03
;; size=188 bbWeight=4 PerfScore 214.00
G_M23748_IG06:
str x3, [x0]
str x4, [x1]
;; size=8 bbWeight=1 PerfScore 2.00
G_M23748_IG07:
ldp fp, lr, [sp],#16
ret lr
;; size=8 bbWeight=1 PerfScore 2.00
RWD00 dq 1010101010101010h, 1010101010101010h
RWD16 dq FFFFFFFFFFFFFFFFh, 3FFFFFFF3EFFFFFFh
RWD32 dq 3B3A393837363534h, FFFFFFFFFFFF3D3Ch
RWD48 dq 06050403020100FFh, 0E0D0C0B0A090807h
RWD64 dq 161514131211100Fh, FFFFFFFFFF191817h
RWD80 dq 201F1E1D1C1B1AFFh, 2827262524232221h
RWD96 dq 302F2E2D2C2B2A29h, FFFFFFFFFF333231h
RWD112 dq 1202211101201000h, 0524140423130322h
RWD128 dq 2717072616062515h, 1A0A291909281808h
RWD144 dq 0D2C1C0C2B1B0B2Ah, 2F1F0F2E1E0E2D1Dh
Emitting R2R PE file: aot
; Total bytes of code: 816 Don't know what is happening but it feels like all the constants were "propagated" back to the loop body. cc @tannergooding @dotnet/jit-contrib |
Ah, looks like it's CSE decided to do so |
EgorBo - how are you running and dumping that? How I was doing it was copying and pasting my code into a test app and running with COMPlus_TieredCompilation=0 It's quite possible there is something different happening when the code is in the library |
So did I. The only exception - I was using crossgen. Can you paste the whole codegen for your case? (not just loop body) |
Ahh, good :) I wasn't sure if there was a better way of doing it. This is my full dump of DecodeFromUtf8:
One of my thoughts was that loading the table was very inefficient - but increasing the text size in the test by 10x didn't improve things |
It's worth noting that when I did this in C using aklomp, I was comparing the plain C version vs the neon C version. Once ported to C#, I'm comparing the plain C# version to the neon C# version. If the aklomp C plain version isn't optimised and the C# plain version is optimised, then this would explain why I'm not seeing any improvements when I go from optimised C# plain version to semi-optimised C# NEON version. |
indicies_sub = AdvSimd.Subtract(indicies_sub, offset); | ||
dest = AdvSimd.Arm64.VectorTableLookupExtension(dest, table6, indicies_sub); | ||
indicies_sub = AdvSimd.Subtract(indicies_sub, offset); | ||
dest = AdvSimd.Arm64.VectorTableLookupExtension(dest, table7, indicies_sub); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this using perf, a lot of time is spent in this function. One reason for that is due to the chain of dependencies.
Splitting the indicies_sub into separate variables didn't make any noticeable difference:
var indicies1 = AdvSimd.Subtract(indicies, Vector128.Create((byte)(16U1)));
var indicies2 = AdvSimd.Subtract(indicies, Vector128.Create((byte)(16U2)));
var indicies3 = AdvSimd.Subtract(indicies, Vector128.Create((byte)(16U*3)));
etc
The TBXs could be split out:
increment everything in the lookup table so that it starts from 1. Do the lookups using TBLs, meaning failures are 0. Combine all the results with ORs. Then subtract 1 from the result.
That would add more complexity, and I very much doubt it's going to give much benefit overall.
Without use of LD4/ST3/TBX4, I very much doubt we can get this above the performance of the existing code. There are some remaining ideas above, but it's very doubtful they will be enough to go from 3x slower to faster, plus they'll just add code complexity. I'll leave this PR open for a bit longer in case anyone has any good ideas. |
@a74nh could you elaborate on why the Arm64 implementation can't basically be a 1-to-1 port of the x86/x64 logic? There isn't really anything in I do understand using |
I don't think that is the case and I agree with @tannergooding about trying to map it with SSE3Decode to get initial perf boost. Just a long shot, can we spread out the dependencies so unrelated loads can happen in parallel? (Likewise doing 4 byte unrelated loads in batches)? Vector128<short> tmp0 = AdvSimd.Arm64.UnzipEven(str0.AsInt16(), str1.AsInt16());
- Vector128<short> tmp1 = AdvSimd.Arm64.UnzipOdd(str0.AsInt16(), str1.AsInt16());
Vector128<short> tmp2 = AdvSimd.Arm64.UnzipEven(str2.AsInt16(), str3.AsInt16());
- Vector128<short> tmp3 = AdvSimd.Arm64.UnzipOdd(str2.AsInt16(), str3.AsInt16());
str0 = AdvSimd.Arm64.UnzipEven(tmp0.AsByte(), tmp2.AsByte());
str1 = AdvSimd.Arm64.UnzipOdd(tmp0.AsByte(), tmp2.AsByte());
- str2 = AdvSimd.Arm64.UnzipEven(tmp1.AsByte(), tmp3.AsByte());
- str3 = AdvSimd.Arm64.UnzipOdd(tmp1.AsByte(), tmp3.AsByte());
-
- // Table lookup on each 16 bytes.
str0 = AdvSimdTbx8Byte(v255, dec_lut0, dec_lut1, dec_lut2, dec_lut3, dec_lut4, dec_lut5, dec_lut6, dec_lut7, str0, v16);
str1 = AdvSimdTbx8Byte(v255, dec_lut0, dec_lut1, dec_lut2, dec_lut3, dec_lut4, dec_lut5, dec_lut6, dec_lut7, str1, v16);
+
+ Vector128<short> tmp1 = AdvSimd.Arm64.UnzipOdd(str0.AsInt16(), str1.AsInt16());
+ Vector128<short> tmp3 = AdvSimd.Arm64.UnzipOdd(str2.AsInt16(), str3.AsInt16());
+ str2 = AdvSimd.Arm64.UnzipEven(tmp1.AsByte(), tmp3.AsByte());
+ str3 = AdvSimd.Arm64.UnzipOdd(tmp1.AsByte(), tmp3.AsByte());
str2 = AdvSimdTbx8Byte(v255, dec_lut0, dec_lut1, dec_lut2, dec_lut3, dec_lut4, dec_lut5, dec_lut6, dec_lut7, str2, v16);
str3 = AdvSimdTbx8Byte(v255, dec_lut0, dec_lut1, dec_lut2, dec_lut3, dec_lut4, dec_lut5, dec_lut6, dec_lut7, str3, v16); |
Let me check this..... |
@tannergooding : looks like this might be workable. However, just a little stumped on the MultiplyAddAdjacent equivalent. You suggested "Unzip + Multiply + AddPairwiseWidening", not quite sure exactly how you were doing that? I'm assuming you're thinking something special here and taking the mergeConstant0 into account. (Instead of just doing a full MultiplyAddAdjacent in neon instructions) |
I was simply giving a naïve equivalent. Notably including
tmp[0] = left[0] * right[0];
tmp[1] = left[1] * right[1]
tmp[2] = left[2] * right[2]
tmp[3] = left[3] * right[3];
res[0] = widen(tmp[0]) + widen(tmp[1]);
res[1] = widen(tmp[2]) + widen(tmp[3]); So this is just |
Ahh, good, what I was thinking already. My first version of using the the SSE3 version is looking more promising - it's only 1.28x slower, instead of 3.5x.
Will see if I can improve it a bit more.... |
This is 1.28x slower than the scalar implementation? |
Yes. I was comparing against a build with no changes. so - it's quite a bit faster than the neon version with all the TBX instructions. |
Yeah, but its surprising that its slower than the scalar version in this case. I'd be interested in the C# code and the disassembly if you have it :) |
It looks to me like the micro benchmark for DecodeFromUtf8 isn't quite right: _encodedBytes will be filled with random byte values. However, that means some of them will be invalid base64 values. That'll cause the simd function to exit early with invalid value here:
Instead, look at the unit tests:
That's using InitializeDecodableBytes to create an array of valid base64 values:
I've rewritten the microbenchmark to use a copy/pasted version of InitializeDecodableBytes. |
I now have two approaches to base64 decode: NEON specific: (this patch)
Vector128 combined: (#70654)
Ultimately, the NEON specific version is the way to go for pure performance. Once which version to use is decided on, I can then look at Encode (which has exactly the same choices to make as this function). I'll also look at posting my fixes to the performance suite too. |
I think the When the JIT gets LD4 support, we can look at if things can be improved. We can likewise look at things like improving the JIT to automatically fold sequential loads into LD2/3/4 and sequential stores into ST2/3/4. |
I'm coming to this conclusion too. For Net7 it is the neater and easier code. I'll leave this request open for a day and then close it if there are no objections. |
Thanks @a74nh for finding that out. Could you please also send a PR to |
Will do once I fix up the Encode tests too |
|
Closing this pull request in favour of #70654 |
Like the AVX2 and SSE3 versions, this is based off the Aklomp base64
algorithm.
The AdvSimd API does not yet have support for squential multi register
instructions, such as TBL4/LD4/ST3. This code implements the those
instructions using single register instructions. Once API support is
added, this code can be greatly simplified and get an additional
performance boost.